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ABSTRACT 


Automatically detecting bugs in student program code is 
critical to enable formative feedback to help students pin- 
point errors and resolve them. Deep learning models es- 
pecially code2vec and ASTNN have shown great success 
for large-scale code classification. It is not clear, however, 
whether they can be effectively used for bug detection when 
the amount of labeled data is limited. In this work, we in- 
vestigated the effectiveness of code2vec and ASTNN against 
classic machine learning models by varying the amount of 
labeled data from 1% up to 100%. With a few exceptions, 
the two deep learning models outperform the classic mod- 
els. More interestingly, our results showed that when the 
amount of labeled data is small, code2vec is more effective, 
while ASTNN is more effective with more training data; for 
both code2vec and ASTNN, the more labeled data, the bet- 
ter. To further improve their effectiveness, we investigated 
the potential of semi-supervised learning which can leverage 
a large amount of unlabeled data to improve their perfor- 
mance. Our results showed that semi-supervised learning is 
indeed beneficial especially for ASTNN. 
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1. INTRODUCTION AND BACKGROUND 


When students encounter difficulties during programming, 
they are often caused by systemic procedural errors, or “bugs” 
[9], which can occur repeatedly across problems [38, 8]. For 
example, a student may confuse when it is appropriate to 
use the and and or operators, or fail to consider a boundary 
case in a condition, using > instead of >= [17]. These bugs 
are rarely directly addressed by the compiler or test-case 


“The first two authors contributed to the manuscript 
equally. 


Yang Shi, Ye Mao, Tiffany Barnes, Min Chi and Thomas Price “More With 
Less: Exploring How to Use Deep Learning Effectively through Semi- 
supervised Learning for Automatic Bug Detection in Student Code”. 2021. 
In: Proceedings of The 14th International Conference on Educational Data 
Mining (EDM21). International Educational Data Mining Society, 446- 
453. https://educationaldatamining.org/edm2021/ 

EDM ’21 June 29 - July 02 2021, Paris, France 


feedback employed in most computer science (CS) courses, 
which are generally limited to suggesting syntax errors, or 
which correct input-output pairs the program fails to repli- 
cate. Historically, tutoring systems in a variety of learn- 
ing domains have detected these bugs automatically (e.g. 
through a bug library |[9, 4]). The detection can be used 
to offer tailored formative feedback [34] that address bugs 
directly [22], and can also help instructors to be more in- 
formed about student learning process [25]. The detection 
of bugs often requires experts’ manual definitions, with dis- 
tinct rules for detecting the bug on different problems [4]. 
This can make it impractical to use bug detection in practice. 
Most current automatic grading systems for student code are 
mainly based on test cases, which provide a score and failed 
test case information to students [15, 16, 37]. Nevertheless, 
the relationship between code’s output and the presence of 
specific bugs in student code is not clear, since a given er- 
roneous output could be caused by various errors in student 
code. An automatic bug detection system for student code 
could be useful to fill in the gaps for students. 


Machine learning (ML) algorithms are powerful tools for 
data analysis, which have been commonly used for auto- 
matic programming code analysis [10]. Classical machine 
learning methods, such as support vector machines [13] and 
XGBoost [11], are capable of classifying program code [12, 
21, 18]. Recent advances in machine learning have lever- 
aged structural information in code to accurately classify 
and label it [2, 3, 41, 28]. For example, Alon et al. explored 
path representations on code represented as trees [2], and 
designed the code2vec model to learn the representations 
using deep neural networks [3]. Abstract Syntax Tree based 
Neural Network (ASTNN) by Zhang et al. applied recursive 
neural networks in the structure, outperforming Tree-based 
Convolutional Neural Networks [28] and other state-of-the- 
art models [41]. 


However, to apply the models to detect student program 
bugs, two challenges need to be addressed. First, these deep 
models were originally designed for professional programs 
which are fundamentally different than code written by stu- 
dents [39]. Some recent work has applied these techniques to 
educational domains [33, 19, 26, 30, 6], but they either used 
base models years before, [28, 19], or are not specifically used 
for bug detection [33, 26, 30, 6]. Second, deep learning mod- 
els are traditionally “data hungry” [1], using large, labeled 
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training datasets (e.g. [19] was trained on 270k samples). 
However, in most educational settings, datasets can be much 
smaller (e.g. ~100 students), and labeling (e.g. to identify 
bugs) can take extensive expert effort [14]. This suggests the 
potential of leveraging a semi-supervised learning strategy, 
using a mixture of labeled and unlabeled data [42]. Semi- 
supervised learning, such as the Expectation-Maximization 
(EM) method, uses unlabeled data for model improvement 
[42]. However, studies show that the usage of unlabeled 
data may not always help [35]. Thus, an empirical evalua- 
tion, suggested by recent studies [29], to investigate whether 
semi-supervised learning with unlabeled data actually helps 
is needed. 


To address these challenges, in this paper, we evaluate two 
state-of-the-art deep learning methods: code2vec [3] and 
ASTNN [41], on the task of automatically detecting pro- 
gramming bugs in student code. We manually labeled three 
bugs in ~1800 code submissions from 410 students in a Java 
programming course, where each bug occurred in 4-6 distinct 
problems. Our results show that, when using all available 
training data, the ASTNN model performs best at detect- 
ing all three bugs, outperforming code2vec and two classical 
baseline models (support vector machines and XGBoost). 


Furthermore, we investigate whether a semi-supervised learn- 
ing approach can improve the code2vec and ASTNN per- 
formance without requiring additional labeled data. More 
specifically, we investigated how the deep and baseline mod- 
els performed with different amounts of labeled training data 
through a “cold start” analysis [32]. We found that all mod- 
els benefited from more data. However, despite deep models’ 
reputation as “data hungry,” we found the top-performing 
model was generally a deep model, regardless of training 
data size. However, which model performed best depended 
on the data size, with code2vec outperforming ASTNN when 
less labeled data was available. We also found that semi- 
supervised learning generally improves both code2vec and 
ASTNN by using unlabeled data. This effect was most con- 
sistent for ASTNN, where semi-supervised learning consis- 
tently improved the model performance by 5% to 20% on 
all splits. For code2vec, we also found that it required very 
little data (5%) to achieve 80% of its peak F1 score. 


The major contributions of this paper are addressing three 
research questions (RQs): 


e RQI: How well do state-of-the-art deep learning mod- 
els for programming code perform in a student bug 
detection task? 


e RQ2: How are deep learning models’ performance im- 
pacted by the amount of available training data? 


e RQ3: To what extent does semi-supervised learning 
improve the performance of the deep learning models? 


2. APPROACHES 


In this section, we introduce how we build code2vec and 
ASTNN for program classification; and how we applied the 
semi-supervised learning strategy on them to leverage unla- 
beled data. 
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Figure 1: Code2vec model structure: model takes a set 
of paths as input, and through embedding layers, attention 
layer, then detect if the input code has bugs (1) or not (0). 


2.1 Code2vec 


One primary technical challenge in applying machine learn- 
ing to program code lies in code representation. Code is 
often represented using an abstract syntax tree (AST) [7], 
while most learning algorithms expect a fixed-length vec- 
tor. To solve this issue, sub-components of the ASTs are 
used as inputs for deep learning models. In the case of the 
code2vec model, it learns a code embedding through leaf- 
to-leaf paths, represented as strings. Strings of nodes and 
paths are mapped into numbers by tokenizers, where differ- 
ent strings are mapped into different numbers. These num- 
bers are used as the input of a code2vec model, shown in 
Figure 1. Assume we have a code snippet that produces R 
paths (po,...,pr) to be fed into the code2vec model. These 
numbers are embedded into e-dimensional vectors through 
node and path embedding layers (Wenode and Wepatn) re- 
spectively, and these node and path vectors are concatenated 
together into one vector for each of the paths (eo, ..., e,). 
These vectors form a matrix E, where E € R°*®. Then 
these path vectors pass through a soft attention layer Wa 
[40], where they calculate the soft attention weight a for 
each of the paths: a = SoftMaz(W.a'E), Wa € R™’, 
and thus a has scalar weights a, for each of the paths, nor- 
malized by a SoftMax operator. Then the embedded path 
vectors E take the dot product of the calculated attention 
weights, showing which paths are more important in a code 
snippet. Then the weighted average vector passes through 
two fully-connected layers to make the bug classifications. 
In the training process, all the W weights are updated us- 
ing Adam [23] optimization algorithm, while in the evalua- 
tion and validation processes, the weights in model are not 
changed. 


2.2 ASTNN 

Different from the path-based inputs for code2vec, ASTNN 
utilize the statement-level ASTs to learn a vector for the 
code. Specifically, we split the large AST of a code fragment 
by the granularity of the statement and extract the sequence 
of statement trees (ST-trees) with a pre-order traversal, and 
feed them as the raw input of ASTNN. Suppose that we 
have a set of ST-trees (s1, 52,...,87), our goal is to learn a 
vector representation z for the original code. The detailed 
architecture of ASTNN is shown in Figure 2. 


Statement Encoder: Each ST-tree is composed of a root 
node and its child indices from a limited vocabulary of up to 
V symbols. For a ST-tree s;, we first represent all nodes with 
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Figure 2; ASTNN model structure: model takes a set of 
statement trees as input, and through encoder layer, Bi-GRU 
layer, max-pooling layer, then to detect if the input code has 
bugs (1) or not (0). 


the pre-trained embedding matrix Wembea € RY” *% where 
V is the vocabulary size and d is the embedding dimension. 
Thus the initial vector of a node n can be obtained by: 


Vn = W | aca (1) 


where x, is the one-hot encoding of node n. Next the ST- 
tree will go through a Recursive Neural Network [36] based 
encoder layer to update the vector for each node: 


hy = o(WeneodeVn i S- h; ar br) (2) 


t€child 


where Wencode € R?*” is the encoding matrix and k is the 
encoding dimension. v,, is obtained from Equation 1 and by, 
is the bias term. a is the activation function and in this work 
we followed the original paper to set o as identity function. 
After recursive optimization of the vectors of all nodes in 
the ST-tree, we sample the final representation e; for s; via 
a max-pooling layer. 


Code Representation: Based on the sequences of ST-tree 
vectors, bidirectional GRU [5] is applied to track the natu- 
ralness of statements sequence (€1, €2,...,er), where T is the 
number of ST-trees in the AST: 


hi = (GRU (e;), GRU(e:)], i € [1,L] (3) 


The statement representation hi € R”’*?™, where m is the 


embedding dimension of GRU. Finally, similar to Statement 
Encoder, a max-pooling layer is used to sample the most im- 
portant features on each of the embedding dimensions. Thus 
we get z € R?”, which is treated as the final vector repre- 
sentation of the original code fragment. Finally z vectors 
pass through a linear layer to make the final classification of 
the bugs. 


2.3. Semi-supervised Learning Strategy 

While we explore the potential of machine learning mod- 
els using insufficient labeled data as training inputs, unla- 
beled data can also serve as an important resource for the 
models to learn the structure of code. We applied a semi- 
supervised learning strategy to utilize these unlabeled data 
to help the model update. Specifically, in our experiments, 
we used Expectation-Maximization (EM) method [42] as an 
exploratory attempt. 


EM method is iterative, and it contains two steps for ev- 
ery iteration: 1) In expectation steps, the model infers on 
the unlabeled dataset, getting a probability score, which will 
be served as the pseudo-label in the next step, for each of 
the unlabeled code snippets; 2) In maximization steps, the 
model is retrained using all the labeled training dataset and 
the unlabeled set with the pseudo-labels from expectation 
step. After retraining, the model is used for the next round 
of expectation step. In our case, deep learning models are 
designed to output probability scores, but SVM and XG- 
Boost models make classifications without clear scores or 
probabilities. We implemented the regression versions of 
the models, assuming they would output a continuous prob- 
ability as the regression result. We then used 0.5 as the 
probability threshold to binarize the output, serving classi- 
fication results. Every model uses a unified 10 iterations of 
EM steps, assuming the models are able to converge after a 
certain number of iterations and retraining. 


3. EXPERIMENT SETTINGS 
3.1 Dataset and Bug Labeling 


We performed bug classification on a publicly available dataset, 
collected from an entry-level Java programming class in Spring 
2019’. It was collected from the CodeWorkout [15] platform 
and stored in ProgSnap2 [31] format. Since Java compiler 
can already detect bugs from code that failed to compile (due 
to syntax errors), and this code cannot be converted into 
an AST, we excluded uncompilable code from our analysis. 
We also did not use code that passed all test cases, as this 
code is correct and therefore is very unlikely to have bugs. 
There are 410 students, who attempted in total 50 problems 
from 5 assignments. Typical solutions for these assignments 
range from 10 to 20 lines of code. In order to determine the 
common set of bugs across different problems, two authors 
examined student code from six distinct programming prob- 
lems from the first assignment and identified common bugs 
that arose. They then selected 3 prevalent ones after calcu- 
lating the coverage of bugs from each problems, and identi- 
fied in prior CS education literature [17, 20]. This included 
2 logical bugs and 1 syntax bug: comparison-off-by-one 
(logical), assign-in-conditional (syntax), and and-vs-or 
(logical), defined below: 


comparison-off-by-one: This bug occurs if, in a condi- 
tional expression (e.g. in an if or while), the student’s 
code uses a greater/less-than comparison operator (<=, >=, 
<, >) incorrectly, and this error can be resolved by adding 
or removing the ‘=’, (e.g. < becomes <=). The direction of 
comparison (i.e. <= vs >=) should already be correct. This 
often indicates an “off by one” error, and it is contextual, de- 
pendent on the number of literals being compared. If there 
are multiple bugs, including this bug, we still count it. 


assign-in-conditional: This bug occurs if, in a condi- 
tional expression, a student uses the = assignment operator 
in their code when trying to compare a variable with another 
value, rather than the correct == comparison operator. This 
is a syntax-based bug, but it is not detected by the compiler, 
since the assignment is logically a valid operation. 


and-vs-or: This bug occurs if a student uses the logical 
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Table 1: Detection performance for four classifiers on three 
bugs. 


Method | Accuracy | AUC | Precision | Recall | F1 Score (Std) 


SVM 0.753 0.658 0.731 0.100 | 0.173 (0.045 


comparison- | XGBoost 0.505 0.541 0.384 0.547 | 0.334 (0.088 


off-by-one | Code2Vec 0.736 0.746 0.500 0.556 | 0.522 (0.058 


ASTNN 0.785 0.704 0.606 0.533 0.560 (0.090) 


assign-in- XGBoost 0.847 0.877 0.494 0.726 | 0.563 (0.112 


conditional | Code2Vec 0.917 0.907 0.725 0.688 0.672 (0.119 
ASTNN 0.970 0.901 0.961 0.807 0.868 (0.094 


SVM 0.722 0.674 0.534 0.173 | 0.256 (0.078 


XGBoost 0.503 0.669 0.350 0.784 0.470 (0.045 
and-vs-or 


Code2Vec 0.758 0.821 0.570 0.663 | 0.609 (0.078 


SVM 0.943 0.959 0.918 0.627 | 0.733 (0.099 


ASTNN 0.880 0.837 0.820 0.739 0.773 (0.064) 


operator and instead of or in their code, or vice-versa, such 
that the opposite operator would produce correct code. This 
is also a logical bug that requires contextual information but 
is easier to detect than comparison-off-by-one. It requires 
the literals, but does not depend much on problem require- 
ments. 


Two authors started by labeling 20% of the data, following 
the same set of initial bug definitions. The labeling process 
was iterative: the two authors first labeled 20% of the data 
independently and then calculated Cohen’s Kappa scores k. 
If on any of the three bugs, the two authors did not achieve 
a score higher than the 0.8 [24], then the authors discussed 
and resolved the disagreements, refined the definitions, and 
continued to another round of independently labeling 10% of 
the data. This process continued until the authors reached 
high agreement (K > 0.8) on all three categories of bugs, 
which occurred after labeling 40% of the data. The first 
two rounds of labeling did not achieve a high « score, both 
due to the low scores on the comparison-off-by-one bug, 
suggest that this bug may be more difficult to consistently 
detect for humans. On the third round of labeling, the two 
authors achieved 0.81, 0.97 and 0.84 « scores on the three 
bugs. Then the authors divided the rest 60% of data by 
35% for each person to label, overlapping on 10% of the data 
for verification. These 10% of data achieved 0.78, 0.98 and 
0.95 « scores, indicating moderate to near-perfect agreement 
[27]. The finalized labeled dataset has a biased distribution, 
as only 30% of the submissions have comparison-off-by- 
one bug, 28% of the submissions have and-vs-or bug, and 
13% have assign-in-conditional bug. In total, we spent 
around 20 hours and labeled 1867 code snippets from 296 
students. 


3.2 Splits in Experiments 

Since our dataset included multiple attempts from a given 
student, we split our data into training and testing sets by 
student. This ensured that a given student’s code showed 
up in either the training or testing set, but not both. In our 
experiment, we have 20% of the data as the test set, and the 
rest 80% are used for model generation. To check the perfor- 
mance of models with limited labeled data, we further split 
the 80% of data into labeled data and unlabeled data. We 
use only labeled data for supervised learning, and use both 
labeled and unlabeled data for semi-supervised setting. All 
these splits were stratified according to the class label and 
number of submissions, ensuring that a similar proportion 
of buggy/non-buggy programs were in each split. This is 
necessary, since splitting by students can create very biased 


distributions, especially when we only have small labeled 
training sets. The stratification uses thresholds for 1) the 
ratio of bugs and 2) averaged submission numbers for stu- 
dents in respective bug groups. We argue that in practice, 
we should be able to select a similarly representative sample 
by manually checking several submissions to see if the distri- 
bution is fundamentally different. To ensure we evaluate our 
model performance with fair comparisons, we created 10 dif- 
ferent splits, generated randomly. All models use the same 
training/testing splits, and average performance metrics are 
reported as the results. For semi-supervised setting, we var- 
ied the size of labeled/unlabeled data to evaluate the per- 
formance of models. In order to perform fair comparisons, 
all semi-supervised models have the same labeled/unlabeled 
splits. Also, all models are tested on the same test sets, 
regardless of the model, the amount of training data, or su- 
pervised vs. semi-supervised. These settings ensured fair 
comparisons across different models. 


3.3. Model Settings 

SVM and XGBoost Parameters: We performed grid search 
on hyperparameters for SVM and XGBoost models using 
cross-validation on the training sets. In the SVM setting, 
we searched linear and Radial Basis Function (RBF) kernels, 
with C parameters in a range of (0.1 - 1), stepping by 0.1. 
In the XGBoost setting, we searched through situations that 
sub-sample portions from 0.1 to 1, stepping by 0.1, using 5 
to 100 estimators in the model. To prepare numerical input, 
we used TF-IDF feature extraction on the code submissions 
for both models. 


Code2vec and ASTNN Parameters: Since deep learning mod- 
els are more time- and resource-consuming, and our cold 
start experiments required many repeated runs (~ 100 runs), 
we did not perform automatic grid search; rather, we used 
default settings of the hyper-parameters and did manual 
changes. In code2vec, after observing the training and val- 
idation loss, we set the maximum training epochs as 200, 
with the patience of early stopping set to 100, and set the 
learning rate to 0.0002. Linear layer and embedding di- 
mensions were kept at the default value of 100. To ensure 
the highest efficiency of the model, we set the batch size 
as the full batch. These parameters are tuned with dif- 
ferent numbers, but little change in validation accuracy is 
observed. We also manually padded the number of paths to 
100 over all code submissions. In ASTNN, we padded the 
statement sequences to the maximum length to accommo- 
date the longest sequence before feeding to Bi-GRU. During 
training, we used 32 as batch size, 0.001 as learning rate, and 
keep the max training epoch as 50. ‘The encoding dimension 
for the statement encoder was 128, and hidden neurons for 
Bi-GRU were 100. The weights were learned during training 
using the Adam optimizer for code2vec and ASTNN models. 


4. RESULTS 


4.1 Bug Detection Model Performance 

In this subsection we address RQ1: How well do state-of- 
the-art deep learning models for programming codes perform 
in a student bug detection task? Table 1 shows the results of 
the classifiers in the task of detecting bugs across problems. 
We use accuracy, Area-under-curve (AUC), Precision (P), 
Recall (R), and F1 score as the evaluation metrics for the 
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Figure 3: F1 score of models using supervised strategy with different portions of labeled data in detecting bugs. 


detection. Across the three bugs for detection, we observe 
that the top-performing model (ASTNN) achieved 0.560, 
0.868, 0.773 F1 scores on detecting comparison-off-by- 
one, assign-in-conditional, and and-vs-or respectively. 
The F1 scores indicate that the models achieved a higher 
performance on detecting assign-in-conditional compared 
to comparison-off-by-one and and-vs-or. In all bug de- 
tection tasks, ASTNN achieved the best F1 score on all bugs. 
On the two logical bugs, ASTNN achieved at least 0.226 
higher F1 score than SVM and XGBoost, while Code2vec 
also achieved at least 0.139 higher, showing deep learning 
models are more preferable for detecting the two logic bugs 
across the six problems. On the detection of the assign- 
in-conditional bug, ASTNN achieved a high F1 score of 
0.868, while a simple SVM model is able to achieve a 0.733 
F1 score, which is not much lower than the ASTNN model. 
However, the recall of SVM is low (0.627), which indicates 
a limited capability of detecting the bugs out of submis- 
sions with bugs. Code2vec model did not achieve better F1 
or AUC scores than SVM or ASTNN model in this case, 
showing that in the detection of syntax issues, paths fea- 
tures might be overly complicated. SVM might have a good 
performance when the real rule to learn is just “If it has = 
instead of == in the code, it has the bug,” since there is little 
contextual information to learn. Generally speaking, when 
using all 80% labeled dataset (~ 1493 programs on average), 
deep learning models have a better performance than tra- 
ditional machine learning models in detecting logical bugs, 
showing the advantage of leveraging structural information 
in the feature extraction step. 


4.2 Bug Detection with Limited Labels 


We address RQ2 in this subsection: How are deep learning 
models’ performance impacted by the amount of available 
training data? From Figure 3, we see the F1 scores of the 
four models in supervised strategy using a subset of labeled 
data. The x-axis is the log-scaled labeled data size, and the 
y-axis is the F1 score that models achieved across the 10 
different splits. The lowest portion of labeled data we use 
is 1%, which contains around 15 students, while the highest 
portion is 80%. The general trend of the supervised models 
shows that when more data is used, better F1 scores can 
be achieved by models, especially ASTNN. We also observe 
some interruptions in the increment of the performance as 
more data is available, meaning that it is not guaranteed 
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that more data generate better models. For other baseline 
models, such a data-performance relationship is weaker, but 
still more data can generally produce better models. 


While the models expect better performance given more 
data, we would like to note that among all supervised mod- 
els, code2vec achieved better results than other models using 
a small subset of labeled data, showing a property of warm 
starting. With 10 percent of labeled data, code2vec has at 
least 7.5% higher F1 scores than any other models on all 
the three detected bugs. When more data is used, ASTNN 
outperforms other models, showing that there is generally at 
least one deep learning model more preferable than baseline 
models. When comparing code2vec with ASTNN, we find 
that deep learning models are not always “data-hungry”: al- 
though both models are are sensitive to data size, code2vec 
starts higher than baseline models in classifying all three 
bugs. To achieve a good detection result, using 30%-40% 
(560-747) less labeled data would create models achieving 
80% of the F1 score. 


With these results we are able to conclude the answer for 
RQ2: For code2vec and ASTNN, more data would produce 
models with better performance. However, the relation is 
not linear: ASTNN is more “data-hungry” than code2vec, 
but these deep learning models do not require lots of data 
points to perform better than baselines. 


4.3 Application of Semi-supervised learning 
This subsection addresses RQ3: To what extent does semi- 
supervised learning improve the performance of the deep learn- 
ing models? Figure 4 shows the semi-supervised learning 
results for all four models and the comparisons to super- 
vised ASTNN and code2vec models. The labeled training 
data for each split is exactly the same as ones used in super- 
vised settings. While the results give a mixed signal about 
whether semi-supervised learning is beneficial for all mod- 
els, we have two observations. 1) semi-supervised learning 
enhanced the learning of deep models, especially ASTNN in 
all three bugs. Comparing the black lines, we found that 
solid lines are always higher than dashed ones. It sug- 
gests ASTNN, as a more “data-hungry model”, is favored 
by the semi-supervised strategy more than in other models. 
Typically, an ASTNN model trained with a semi-supervised 
learning strategy achieves 0.05 to 0.2 higher F1 scores than 
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Figure 4: F1 score of models using semi-supervised strategy with different portions of labeled data in detecting bugs. Red circles 
noted places when semi-supervised strategy outperformed supervised training with full data. 


those trained in supervised learning strategy, using the same 
training dataset. In code2vec, which is also a deep learn- 
ing model, semi-supervised learning does not always help. 
It helps code2vec achieve better F1 scores when using a 
lower portion of data, but when given more data, a super- 
vised learning strategy provides better performance. Semi- 
supervised learning does not help much for the other two 
classical models, compared with deep models. 2) In semi- 
supervised learning scenario, ASTNN achieved a better per- 
formance when using 70% labeled data than using all 80% 
as training in detecting two bugs, assign-in-conditional 
and and-vs-or, by 2.8% and 7.1% respectively (red-circled 
in Figure 4). While this may reflect the fluctuation of data 
performance, we did run the models 10 times. This suggests 
that the model may be harnessing the semi-supervised learn- 
ing strategy to infer labels for unlabeled sets, and achieve 
more consistent labels than the authors, or some outliers 
present in the unlabeled set. We assume that the model then 
learned on these automatically inferred labels and achieved 
better results than learning from all expert labeled data. 


Our conclusion for RQ3 is that semi-supervised learning of- 
ten improves performance, especially when little training 
data is available. It enables the models to achieve an ex- 
pected performance with less labeled data than the super- 
vised scenario. Specifically, semi-supervised learning helped 
all cases in the learning of ASTNN models, and helped 
code2vec overall as well, especially when data size is low. 


5. DISCUSSION CONCLUSION 


Our results suggest three primary conclusions: 1) The two 
deep learning models generally outperformed baselines, and 
ASTNN had the best performance. Our results from Subsec- 
tion 4.1 show that deep learning models can detect simpler 
bugs, but still have a limited effectiveness on more com- 
plicated bugs (detailed in Subsection 3.1). The complex- 
ity of the comparison-off-by-one bug may be due to the 
difficulty of the labeling process, or its dependency on the 
problem context. 2) Deep learning models may still be suc- 
cessful when labeled data is limited. From the results in Sub- 
section 4.2, we learn that even if training with small data 
size such as < 100 data points in complicated programming 
data, the code2vec model is still able to outperform base- 
line models. 3) Semi-supervised learning has the potential to 
help deep learning models perform better. Semi-supervised 
learning helped code2vec to achieve a higher performance, 


but only when a small number of data points are labeled. 
One may assume the difference between the two deep learn- 
ing models come from the structures, but it may also come 
from the feature extraction process. Code2vec uses paths 
based features but ASTNN uses node based features, and 
recursively processed by neural networks. 


Our results can have other potential applications in edu- 
cational program analysis tasks as well. For example, as 
features are automatically extracted from student code dur- 
ing code2vec or ASTNN training, these features can be used 
to help instructor discover new bugs, as suggested by [33], 
which can help shape instruction. If more features such as 
problem requirements and test case inputs are available, we 
can apply these features to the model introduced by [80] 
to propagate instructor feedback to all students who would 
benefit form it. 


This work also has a couple of caveats or limitations as of 
the current stage. 1) We only performed extensive exper- 
iments on three bugs and used them to generalize to con- 
clusions. This is because the dataset labeling is time con- 
suming, requiring the authors to label ~ 1800 data points. 
The conclusions here may not generalize to other bugs or 
code classification tasks. 2) Similarly, these bugs also come 
from one programming assignment near the beginning of the 
course, focused on if conditions, and thus may be biased to 
this specific type of problem. 3) In the splitting process, 
we performed stratified sampling, requiring that test, la- 
beled, and unlabeled data be a similar distribution of class 
labels and the number of attempts. 4) Since we only com- 
pared our models with two classical model baselines, there 
may be other better models existing for better performance. 
We used our best effort to select representative models that 
achieve state of the art performance, but there might be 
better models available for the task as well. This work’s 
primary goal is to lay the foundation for using deep models 
in this task by exploring if the “data-hungry” property also 
applies here, and potential applications of semi-supervised 
learning. It serves as a step towards future model designs 
specific for automatic student bug detection, and provides 
guideline for situations when labeled data is limited. 
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